Multi-way Tensor Factorization for Unsupervised Lexical Acquisition

نویسندگان

  • Tim Van de Cruys
  • Laura Rimell
  • Thierry Poibeau
  • Anna Korhonen
چکیده

This paper introduces a novel method for joint unsupervised aquisition of verb subcategorization frame (SCF) and selectional preference (SP) information. Treating SCF and SP induction as a multi-way co-occurrence problem, we use multi-way tensor factorization to cluster frequent verbs from a large corpus according to their syntactic and semantic behaviour. The method extends previous tensor factorization approaches by predicting whether a syntactic argument is likely to occur with a verb lemma (SCF) as well as which lexical items are likely to occur in the argument slot (SP), and integrates a variety of lexical and syntactic features, including co-occurrence information on grammatical relations not explicitly represented in the SCFs. The SCF lexicon that emerges from the clusters achieves an F-score of 68.7 against a gold standard, while the SP model achieves an accuracy of 77.8 in a novel evaluation that considers all of a verb’s arguments simultaneously. TITLE AND ABSTRACT IN FRENCH Factorisation de tenseurs à plusieurs dimensions pour l’acquisition lexicale non supervisée Cet article présente une méthode originale pour l’acquisition simultanée de cadres de souscatégorisation (subcategorization frames) et de restrictions de sélection (selectional preferences) appliquée au lexique verbal. L’induction simultanée de ces deux types d’information est vue comme un problème de cooccurrence à plusieurs dimensions. On introduit donc une méthode de factorisation de tenseurs, afin de classer les verbes fréquents d’un grand corpus suivant leur comportement syntaxique. L’approche est fondée sur un ensemble de traits de nature syntaxique et lexicale, y compris des informations de cooccurrence au sein des relations grammaticales qui ne sont pas explicitement représentées dans les schémas de sous-catégorisation. Le dictionnaire de sous-catégorisation produit par la méthode de classification obtient une F-mesure de 68,7 lors de l’évaluation face à un dictionnaire de référence tandis que les restrictions de sélection ont une exactitude (accuracy) de 77,8 en tenant compte de tous les arguments simultanément.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Discovering Hidden Structure in High Dimensional Human Behavioral Data via Tensor Factorization

In recent years, the rapid growth in technology has increased the opportunity for longitudinal human behavioral studies. Rich multimodal data, from wearables like Fitbit, online social networks, mobile phones etc. can be collected in natural environments. Uncovering the underlying low-dimensional structure of noisy multi-way data in an unsupervised setting is a challenging problem. Tensor facto...

متن کامل

Multiobjective Optimization and Unsupervised Lexical Acquisition for Named Entity Recognition and Classification

In this paper, we investigate the utility of unsupervised lexical acquisition techniques to improve the quality of Named Entity Recognition and Classification (NERC) for the resource poor languages. As it is not a priori clear which unsupervised lexical acquisition techniques are useful for a particular task or language, careful feature selection is necessary. We treat feature selection as a mu...

متن کامل

Factorization of Multiple Tensors for Supervised Feature Extraction

Tensors are effective representations for complex and time-varying networks. The factorization of a tensor provides a high-quality low-rank compact basis for each dimension of the tensor, which facilitates the interpretation of important structures of the represented data. Many existing tensor factorization (TF) methods assume there is one tensor that needs to be decomposed to low-rank factors....

متن کامل

Mining Labelled Tensors by Discovering both their Common and Discriminative Subspaces

Conventional non-negative tensor factorization (NTF) methods assume there is only one tensor that needs to be decomposed to low-rank factors. However, in practice data are usually generated from different time periods or by different class labels, which are represented by a sequence of multiple tensors associated with different labels. This raises the problem that when one needs to analyze and ...

متن کامل

Multi-HDP: A Non Parametric Bayesian Model for Tensor Factorization

Matrix factorization algorithms are frequently used in the machine learning community to find low dimensional representations of data. We introduce a novel generative Bayesian probabilistic model for unsupervised matrix and tensor factorization. The model consists of several interacting LDA models, one for each modality. We describe an efficient collapsed Gibbs sampler for inference. We also de...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012